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Abstract 

With the advent of high-throughput profiling methods, interest in reverse engineering the structure 
and dynamics of biochemical networks is high. Recently an algorithm for reverse engineering of 
biochemical networks was developed by Laubenbacher and Stigler. It is a top-down approach using 
time discrete dynamical systems. One of its key steps includes the choice of a term order. The 
aim of this paper is to identify minimal requirements on data sets to be used with this algorithm 
and to characterize optimal data sets. We found minimal requirements on a data set based on how 
many terms the functions to be reverse engineered display. Furthermore, we identified optimal data 
sets, which we characterized using a geometric property called "general position". Moreover, we 
developed a constructive method to generate optimal data sets, provided a codimensional condition 
is fulfilled. In addition, we present a generalization of their algorithm that does not depend on the 
choice of a term order. For this method we derived a formula for the probability of finding the 
correct model, provided the data set used is optimal. We analyzed the asymptotic behavior of the 
probability formula for a growing number of variables n (i.e. interacting chemicals). Unfortunately, 
this formula converges to zero as fast as r 9 ™, where q € N and < r < 1. Therefore, even if 
an optimal data set is used and the restrictions in using term orders are overcome, the reverse 
engineering problem remains unfeasible, unless prodigious amounts of data are available. Such 
large data sets are experimentally impossible to generate with today's technologies. 

Keywords: Reverse engineering, data requirements, biochemical networks, time discrete dynami- 
cal systems, orthogonality 



1 Introduction 

Since the development of multiple and simultaneous measurement techniques such as mi- 
croarray technologies, reverse engineering of biochemical and, in particular, gene regulatory 
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networks has become a more important problem in systems biology . One well-known re- 
verse engineering approach are the top-down methods, which try to infer network properties 
based on the observed global input-output-response. The observed input-output-response is 
usually only partially described by available experimental data. 

Depending on the type of mathematical model used to describe a biochemical process, 
a variety of top-down reverse engineering algorithms have been proposed (De Jong, 2002), 
(D'haeseleer et al. , 2000), (Gardner & Faith, 2005). Each modeling paradigm presents differ- 
ent requirements relative to quality and amount of the experimental data needed. Moreover, 
for each type of model, a suitable mathematical framework has to be developed in order 
to study the performance and limitations of reverse engineering methods. For any given 
modeling paradigm and reverse engineering method it is important to answer the following 
questions: 

1. What are the minimal requirements on data sets? 

2. Can data sets be characterized in such a way that "optimal" data sets can be identi- 
fied? (Optimality meaning that the algorithm performs better using such a data set 
compared to its performance using other data sets.) 

The second question is related to the design of experiments and optimality is characterized 
in terms of quantity and quality of the data sets. 

(Laubenbacher & Stigler, 2004) developed a top-down reverse engineering algorithm for 
the modeling paradigm of time discrete finite dynamical systems. Herein, we will refer to 
it as the LS-algorithm. They apply their method to biochemical networks by modeling the 
network as a time discrete finite dynamical system, obtained by discretizing the concentration 
levels of the interacting chemicals to elements of a finite field. One of the key steps of the 
LS-algorithm includes the choice of a term order. The modeling paradigm of time discrete 
finite dynamical systems generalizes the Boolean approach (Kauffman, 1993) (where the field 
only contains the elements and 1). Moreover, it is a special case of the paradigm described 
in (Thomas, 1991). 

Some aspects of the performance of the LS-algorithm were studied by (Just, 2006) in a 
probabilistic framework. 

In this paper we investigate the two questions stated above in the particular case of the 
LS-algorithm. For this purpose, we developed a mathematical framework 1 that allows us 
to study the LS-algorithm in depth. Having expressed the steps of the LS-algorithm in our 
framework, we were able to provide concrete answers to both questions: First, we found 
minimal requirements on a data set based on how many terms the functions to be reverse 
engineered display. Second, we identified optimal data sets, which we characterize using a 
geometric property called "general position". Moreover, we developed a constructive method 
to generate optimal data sets, provided a codimensional condition is fulfilled. 

In addition, we present a generalization of the LS-algorithm that does not depend on the 
choice of a term order. We call this generalization the term- order- free reverse engineering 
method. For this method we derive a formula for the probability of finding the correct 



This framework is based on a general linear algebraic result stated in (Dclgado-Eckcrt, under review). 
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model 2 , provided the data set used satisfies an optimality criterion. Furthermore, we analyze 
the asymptotic behavior of the probability formula for a growing number of variables n 
(i.e. interacting chemicals). Unfortunately, this formula converges to zero as fast as r q " , 
where q G N and < r < 1. Consequently, we conclude that even if an optimal data 
set is used and the restrictions imposed by the use of term orders are overcome, the reverse 
engineering problem remains unfeasible, unless experimentally impracticable amounts of data 
are available. This result discouraged us from including in this paper any computational and 
algorithmic aspects of the term-order-free reverse engineering method. 

In contrast to (Just, 2006), we focus here on providing possible criteria for the design of 
specific experiments instead of assuming that the data sets are generated randomly. More- 
over, we do not necessarily assume that information about the actual number of interactions 
in the biochemical network is available. 

The organization of this article is the following: 

Section 2 is devoted to the mathematical background: We briefly describe the LS- 
algorithm and provide a mathematical framework to study it. Moreover, we introduce the 
term-order-free reverse engineering method. We finish the section with a clear formulation 
of the questions studied in this paper. Section 3 presents rigorous results and some of their 
consequences. In Section 4 we summarize our main results, discuss their consequences and 
provide further conclusions. 

To fully understand the technical details of our analysis, very basic knowledge in linear 
algebra and algebra of multivariate polynomials is required. We refer the interested reader 
to (Golan, 2004) and (Cox et al. , 1997). 



2 Mathematical background 

2.1 A short description of the LS-algorithm 

In the modeling paradigm described by (Laubenbacher & Stigler, 2004), a biological or 
biochemical system described by n varying quantities is studied by taking m consecutive 
measurements of each of the interacting quantities. This yields one time series 

Such series of consecutive measurements are repeated t times starting from different initial 
conditions, where the length m k of the series may vary. At the end of this experimental 
procedure, several time series are obtained: 

sli, sl mi 
ski, •••) sk mfc 



st \ , . . . , st 



rn f 



We will give a precise definition of " correct model" . 
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Each point in a time series is a vector in W 1 . Time series are then discretized using a 
discretization algorithm that can be expressed as a map 

D : R n -> S n (1) 

where the set S is a finite field of cardinality p := \S\ (the cardinality of the field used is 
determined during the discretization process). The discretized time series can be written as 

dki := D(ski), dk mfc := D(sk m J, k = 1, t 

One fundamental assumption made in their paper is that the evolution in time of the dis- 
cretized vectors obeys a simple rule, namely, that there is a function 

F : S n -> S n 

such that 

dkj+i = F(dkj) for i = 1, m fe - 1, k = 1, t (2) 

(Laubenbacher & Stigler, 2004) call F the transition function of the system. One key ingre- 
dient in the LS-algorithm is the fact that the set S is endowed with the algebraic structure 
of a finite field. Under this assumption, the rule (2) reduces to a polynomial interpolation 
problem in each component, i.e. for each j e {1, ...,n} 

dk( i+ i)j = F j (dk i ) for k = 1, t, % = 1, m k - 1 (3) 

The information provided by the equations (3) usually underdetermines the function 
Fj : S n — > S, unless for all possible vectors x e S*™, the values i^-(x) are established by 
(3). Indeed, any non-zero polynomial function that vanishes on all the data inputs 

X := {dki | k = 1, ...,t, i = 1, ...,m k - 1} 

could be added to a function satisfying the conditions (3) and yield a different function 
that also satisfies (3). Among all those possible solutions, the LS-algorithm chooses the 
most parsimonious interpolating polynomial function Fj : S n — > S according to some chosen 
term order. To generate the most parsimonious function the algorithm first takes as input 
the discretized time series and generates functions fj, j = l,...,n that satisfy (3) for each 
j G {1, n} correspondingly. Secondly, it takes a monomial order <j as input and generates 
the normal form of fj with respect to the vanishing ideal I{X) and the given order < . For 
every j e {1, ■■■,n}, this normal form is the output Fj of the algorithm. 
We also refer to 2.1 in (Just, 2006) for another rigorous description of the LS-algorithm. 

2.2 A mathematical framework to study the reverse engineering 
problem 

The mathematical framework presented here is based on a general result stated in (Delgado- 
Eckert, under review). This framework will allow us to study the LS-algorithm as well as a 
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generalized algorithm of it that is independent on the choice of term orders. 

We start with the original problem: Given a time-discrete dynamical system over a finite 

field S in n variables 

F : S n -> S n 

and a data set X C S n generated by iterating the function F starting at one or more 
initial values, what are the chances of reconstructing the function F if the LS-algorithm or 
a similar algorithm is applied using X as input time series? 3 Since the algorithms studied 
here generate an output model G : S n — > S n by calculating every single coordinate function 
Gi : S n — > S separately, we will focus on the reconstruction of a single coordinate function 
F{ which we will simply call /. We will use the notation F q for a finite field of cardinality 
q G N. In what follows, we briefly review the main definitions and results stated and proved 
in (Delgado-Eckert, under review): 

We denote the (/"-dimensional vector space of functions g : F™ — > F q with F n (F q ). A basis 
for F n (F q ) is given by all the monomial functions ~x a : = x" 1 • ... • x^ n where the exponents 
ctj are non-negative integers satisfying aij < q. The set of all those monomial functions is 
denoted with (g n qa)a£M™, where M™ := {a G (N ) n | otj < q V j G {1, ■■■,n}} . We call those 
monomial functions fundamental monomial functions. 

Theorem 1 (and Definition) Let F q be a finite field and n, m G N natural numbers with 
m < q n . Further let 

X:=(x 1 ,...,x m )e(F n q ) m 
be a tuple of m different n-tuples with entries in the field F q . Then the mapping 



: F n (F q )^F 
f » ^(/):=(/(fO,..,/(f m )) 



m 

t 



is a surjective linear operator. <J>^ is called the evaluation epimorphism of the tuple X. 

For a given set X C F™ of data points, the interpolation problem of finding a function 
g G F n (F q ) with the property 

g(xi) = bi V i G {1, m}, x,Gl 

can be expressed using the evaluation epimorphism as: Find a function g G F n (F ? ) with the 
property 

®x(9) = b (4) 

Since a basis of F n (F g ) is given by the fundamental monomial functions (g n qa)aeM™, the 
matrix 

A := (<S>x(g ng a))aeM2 & M(m x q n ;F q ) 



3 From an experimental point of view the following question arises: What is the function F in an ex- 
perimental setting? Contrary to the situation when models with an infinite number of possible states are 
reverse engineered (see 1.2 in (Ljung, 1999)), there is a finite number of experiments that could be, at least 
theoretically, performed to completely characterize the system studied. In this sense, even in an experimental 
setting, there is an underlying function F. The components of this function is what (Just, 2006) called h trU e- 
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representing the evaluation epimorphism $^ of the tuple X with respect to the basis 
{9nqa)aeM« of F n (F q ) and the canonical basis of F™ has always the full rank m = min(m, q n ). 
That also means, that the dimension of the ker(<3>^) is 

dim(ker(<3>^)) = dim(F n (F g )) - m = q n - m (5) 

In the case m < q n where m is strictly smaller than q n = |F"| we have dim(ker($^)) > 
and the solution of the interpolation problem is not unique. There are exactly g dim ( ker (*x)) 
different solutions which constitute an affine subspace of F n (F q ). Only in the case m = q n , 
that means, when for all elements of F™ the corresponding interpolation values are given, the 
solution is unique. If the problem is underdetermined and no additional information about 
properties of the possible solutions is given, any algorithm attempting to solve the problem 
has to provide a selection criterion to pick a solution among the affine space of possible 
solutions. The LS-algorithm chooses the most parsimonious interpolating polynomial func- 
tion according to some chosen term order. A more geometric approach to pick one solution 
would be to select the solution that is perpendicular (or orthogonal) to the affine space of 
solutions. As stated in Remark and Theorem 32 of (Delgado-Eckert, under review), the 
solution selected by the LS-algorithm is precisely the orthogonal solution. For orthogonality 
to apply, a generalized inner product has to be defined on the space F n (F q ). We finish this 
subsection reviewing this concepts (cf. (Delgado-Eckert, under review)). 
The space F n (F q ) is endowed with a symmetric bilinear form (•, •) : V x V — > K, i.e. a 
generalized inner product. Orthogonality and orthonormality are defined as in an Euclidean 
vector space. 

For a given set X C F™ of data points, consider the evaluation epimorphism $^ of the tuple 

X and its kernel ker(<£>^). Now, let (ui, ...,w s ) be a basis of ker($^) C F n (F q ). By the basis 
extension theorem, we can extend the basis (ui, ...,u s ) to a basis 

(Ui, ...,u s ,u s+1 , ...,u d ) 

of the whole space F n (F q ). (There are many possible ways this extension can be performed. 
See more details below). As in example 6 of (Delgado-Eckert, under review), we can construct 
a generalized inner product on F n (F q ) by setting 

(ui,Uj) := 6ij V i,j e {1, ...,d} 

The orthogonal solution of (4) is the solution v* e F n (F q ) that is orthogonal to ker($^), 
i.e. it holds $^(f*) = b and for an arbitrary basis (wi, w s ) of ker(T) the following 
orthogonality conditions hold 

(wi,v*) = OVie{l,...,s} 
The way we extend the basis (u±, ...,u s ) of ker($^) to a basis 



(Ui, ...,U s ,U s+ i, ...,u d ) 
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of the whole space F n (F g ) determines crucially the generalized inner product we get by 
setting 



Consequently, the orthogonal solution of $x(g) = b may vary according to the chosen 
extension u s+ i, Ud G F n (F q ). In (Delgado-Eckert, under review) a systematic way to 
extend the basis (ui,...,u s ) to a basis for the whole space is introduced. With the basis 
obtained, the process of defining a generalized inner product according to (6) is called the 
standard ortho-normalization. This is because the basis (ui, u s , u s +i, ...,Ud) is orthonormal 
with respect to the generalized inner product defined by (6). 

As shown in Section 5 of (Delgado-Eckert, under review), using the generalized inner product 
obtained by applying the standard orthonormalization, the functions generated by the LS- 
algorithm are orthogonal solutions of the polynomial interpolation problem as formulated 
in (4). Under these assumptions the orthogonal solution is also unique (see theorem 9 in 
(Delgado-Eckert, under review)). 

The standard orthonormalization process depends on the way the elements of the basis 
(9nqa)a£M™ of fundamental monomial functions are ordered. If they are ordered according to 
a term order, the calculation of the orthogonal solution of (4) yields the same result as the 
LS-algorithm. If more general linear orders are allowed, a more general algorithm emerges 
that is not restricted to the use of term orders. This algorithm can be seen as a generalization 
of the LS-algorithm. We call it the term- order-free reverse engineering method. The precise 
definition of the standard orthonormalization procedure is stated in Section 4 of (Delgado- 
Eckert, under review). In the appendix we summarize the steps of the term-order-free reverse 
engineering method. 

2.3 The questions studied in this paper 

The mathematical framework developed in the previous subsection will allow us to answer 
the following questions regarding the LS-algorithm and its generalization, the term-order-free 
reverse engineering method: 

Problem 2 Given a function f G F n (F g ), what are the minimal requirements on a set 
X C F™, such that the LS-algorithm reverse engineers f based on the knowledge of the 
values that it takes on every point in the set X ? 

Problem 3 Are there sets X C F" that make the LS-algorithm more likely to succeed in 
reverse engineering a function f G F n (F q ) based only on the knowledge of the values that it 
takes on every point in the set X? 4 

Problem 4 Given a function f G F n (F q ) and an optimal set X C F™ (in the sense of the 
previous problem). If the term order used by the LS-algorithm is chosen randomly, can the 
probability of success be calculated? If the linear order used by the term- order- free method is 
chosen randomly, can the probability of success be calculated? 




(6) 



4 A solution to this problem would provide criteria for the design of experiments. 
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Problem 5 What is the asymptotic behavior of the probability for a growing number of 
variables n ? 

It is pertinent to emphasize that, contrary to the scenario studied in (Just, 2006), we do 
not necessarily assume that information about the number of variables actually affecting / 
is available. We will give further comments on this issue at the end of the conclusions. 

3 Results 

3.1 Basic definitions and facts 

For what follows recall that M q = {a G (N ) n | atj < q V j G {1, n}} . 

Lemma 6 (and Definition) Let K be a field, n, q G N natural numbers and K[t\, r n ] 
the polynomial ring in n indeterminates over K. Then the set of all polynomials of the form 

a a T?K..7?eK[T 1 ,...,T n ] 

aeM™ 

with coefficients a a G K is a vector space over K. We denote this set with 
P?(K)cK[ Tl ,...,T n ]. 

Theorem 7 Let F q be a finite field and n G N a natural number. Then the vector spaces 
Pq(F q ) and F n (F q ) are isomorphic via the mapping 

cp : Pq(F q ) — > F n (F q ) 

g = ^r-C 1 ^v(gW) ■■= Yl aa ~* a 

Definition 8 Let K be a field, n,m G N natural numbers and K[t\, ...,r n ] the polynomial 
ring in n indeterminates over K. Furthermore, let gi, g m G K[ti, r n ] be polynomials. 
The set 

(gi,--,g m ) ■= {hgi + -h m g m \ h 1 ,...,h m G K[r 1 ,...,T n ]} 
is called the ideal generated by g 1 , ...,g m . 

For a tuple x = (x±, x n ) we write x := {x±, x n } for the set containing all the entries 
in the tuple x. 

3.2 Conditions on the data set 

Definition 9 Let f G F n (F q ) be a polynomial function. The subset of F q containing all 
values on which the polynomial function f vanishes is denoted by 

v{ v -\f)) 

where y? is the mapping defined in theorem (7). 
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The following result tells us that if we are using the LS-algorithm to reverse engineer 
a nonzero function we necessarily have to use a data set X containing points where the 
function does not vanish. 

Theorem 10 Let f G F n (F,j)\{0} be a nonzero polynomial function. Furthermore let 

X:= (f 1 ,...,f m )G(F;r 
be a tuple of m different n-tuples with entries in the field F q , b G F™ be the vector defined by 

h ■■= f(xi), i = 1, ...,m 
and v*the orthogonal solution of § x (g) = b.Then if v* = f it follows 5 

v^-\f)ynx^0 

Proof. If V(ip~ 1 (f)) c nX = then by definition of V((p~ 1 (f)), the vector b would be equal 
to the zero vector 0. From Corollary 11 in Subsection 2.2 of (Delgado-Eckert, under review) 
we know that the orthogonal solution i>*of ®x(g) = is the zero function, thus v* ^ /. ■ 

Theorem 11 Let f G F n (F 9 )\{0} be a nonzero polynomial function. Furthermore let 

X:=(f 1 ,...,f m )G(F^r 

be a tuple of m different n-tuples with entries in the field F q , b G F™ be the vector defined by 

k ■■= f(xi), i = 1, ...,m 

andv*the orthogonal solution of$ x (g) = b. In addition, assume V(tp~ l (f)) c fl X ^ 0. Then 
it holds 

v* = f f e span(u s+ i, ...,u d ) 

Proof. The claim follows directly from the definition of orthogonal solution and its 
uniqueness. ■ 

Remark 12 From the necessary and sufficient condition 

f G span(u s+1 , ...,u d ) (7) 

it becomes apparent, that if the function f is a linear combination of more than 
d — s = m fundamental monomial functions, f can not be found as an orthogonal solu- 
tion v* of $x(g) = b. In particular, if f is a linear combination containing all d fundamental 
monomial functions in (g nq a)aeM™, no proper subset IcFJ of F™ will allow us to find f 

as orthogonal solution of& x (g) = b (where bi := /(xj), Xi G X). 
5 If A is a set, A c denotes its complement 
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Remark 13 From the condition (7) follows that it is necessary that a monomial function 
appearing in f is linearly independent of the basis vectors u\,...,u s o/ker($^). For this 
reason, the set X should be chosen in such a way that no fundamental monomial function 
(g n qa)aeM™ is linearly dependent on the basis vectors ui, ...,u s of ker(^^). Otherwise, some of 
the terms appearing in f might vanish on the set X and wouldn't be detectable by any reverse 
engineering method, (Laubenbacher & Stigler, 2004)- This problem introduces a more general 
question about the existence of vector subspaces in "general position" : 

Definition 14 Let W be a finite dimensional vector space over a finite field F q with 
dim(W) = d > 0. Furthermore, let (wi,...,w d ) be a fixed basis of W and s G N a natu- 
ral number with s < d. A vector subspace U C W with dim(C7) = s is said to be in general 
position with respect to the basis (wi, ...,Wd) if for any basis (vi, ...,v s ) ofU and any injective 
mapping 

7r:{l,..,(d-s)}-{l,..,t0 

the vectors 

V 1 , ...,V a ,W n (i), ...,UV( d _ s ) 

are linearly independent. 

It can be shown, that if the cardinality q of the finite field F q is sufficiently large, proper 
subspaces in general position of any positive dimension always exist. The proof is provided 
in the appendix. 

Now assume that ker(<E»^) is in general position with respect to the basis (g n qa)aeM™ of 
F n (F q ). Following the basis extension theorem and due to the general position of ker(<£>^), 
we can extend the basis (ui, ...,u s ) of ker($^) to a basis 

...,u s ,u s+1 , ...,u d ) 

of the whole space F n (F q ), where {u s+ i, ...,u d } C {g n qa}a£M™ is any subset with d — s 
elements of {g nq a}a£M™- Now we can construct a generalized inner product on F n (F q ) by 
setting 

{ui,Uj) := 6ij Vij'e {1, ...,d} 

The advantage in this situation is that there is no bias imposed by the data on the 
monomial functions that can be used to extend the basis (ui,...,u s ) to a basis of F n (F q ), 
i.e. there are no restrictions on the structure of ker($ J( ;)- L . In addition, having this degree 
of freedom, it is possible to calculate the exact probability of success of the method based 
on the number of fundamental monomial functions actually contained in /. We will give 
an explicit probability formula in the next Subsection. For our further analysis we need the 
following intermediate result, whose proof is left to the reader: 

Lemma 15 (and Definition) Let F q be a finite field, n, s G N natural numbers with 
s < dim(F n (F 9 )). Furthermore, let U C F n (F q ) be an s-dimensional subspace. Then the 
set 

V(U) :=V((v- 1 (u 1 ),...,^ 1 (u s )))CF n q 
where (ui,...,u s ) is any basis of U is independent on the choice of basis and it's called the 



Reverse Engineering Finite Dynamical Systems 



11 



variety of the subspace U. 

Now the following question arises: How should the set X be chosen in order to have 
ker($^) in general position with respect to the basis (g n qa)aeM^ For a given natural number 
s < d : = dim(F n (F g )) the idea is to start from a basis (ui,...,u s ) of a vector subspace 
U C F n (F q ) in general position with respect to the basis (g n qa)a£M™- The next step is to 
calculate the variety 



where m := \Y\ . We know from Remark 23 in Subsection 3.2 of (Delgado-Eckert, under 
review) that dim(ker($ i ?)) = dim(F n (F,j)) — \Y\ = d — m. Now, in general, for the kernel 
ker($y>) of the corresponding evaluation epimorphism $^ it holds 



and therefore s < dim(ker($ i ?)) = d — m, i.e. m < d — s. Now, the ideal scenario would be 
the case ker($ 1 ?) = U, i.e. m = d — s. A less optimistic scenario is given when U C ker($ i ?) 
is a proper subspace of ker($ i ?). In such a situation, ideally we would wish for ker($ i ?) to be 
itself in general position with respect to the basis (g n qa)aeM™- This issues raise the following 
question: 

When does there exist a subspace U C F n (F q ) in general position with respect to the basis 
{.9nqa)aeM n with dim(C/) < dim(F n (F g )) that in addition satisfies 



This is an interesting question that requires further research. It is related to whether the 
subspace U is an ideal of F n (F q ), when F n (F q ) is seen as an algebra with the multiplication of 
polynomial functions as the multiplicative operation. In the Appendix we provide examples 
in which two subspaces, both in general position, show a different behavior regarding the 
condition (8). We formalize this property: 

Definition 16 Let U C F n (F q ) be a subspace and (ui,...,u s ) an arbitrary basis of U . U is 
said to satisfy the codimension condition if it holds 



where codim{U) := dim(F n (F g )) — dim(C/). 

A subspace U C F n (F q ) in general position with respect to the basis (g n qa)a£M« that 
satisfies the codimension condition allows for the construction of an optimal set for use with 
the LS-algorithm. The set Y := (/9 _1 (m s ))) (where u 1: ... : u s is a basis of U) 

has namely the property ker($^) = U, i.e. ker($ i >) is in general position with respect to the 
basis (g n qa)aeM™- In other words, subspaces in general position that satisfy the codimension 
condition provide a basic component for a constructive method for generating optimal data 
sets. More generally we define: 




U C ker($ i? ) 




(8) 



codim{U) = \V((cp- l ( Ul ),...,v- l (u s )))\ 
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Definition 17 A set X C F™ snc/i £/ia£ ker($^) is in general position with respect to the 
basis (g n qa)aeM™ is referred to as optimal. 

Remark 18 (and Definition) Additional study is required to prove whether optimal data 
sets exist in general. (See the Appendix for concrete examples.) However, if no optimal sets 
can be determined, it is still advantageous to work with a data set X that was obtained as 
<y9~ 1 (-u s ))), where (u 1 ,...,u s ) is a basis for a subspace U in general position 
with respect to the basis (g nqa ) a eM™- In this case, at least U C ker(<£>^) still holds and it 
might be that the dimensional difference between U and ker($ 5 ?) is small. We call such data 
sets pseudo-optimal. 



3.3 Probabilities of finding the original function as the orthogonal 
solution 

Theorem 19 LetF q be a finite field, n,m G N natural numbers withm < dim(F n (F 9 )) =: d. 
Furthermore, let f G F„(F g )\{0} be a nonzero function consisting of a linear combination 
of exactly t fundamental monomial functions and 

X:=(x 1: ...,x m )e(F n q ) m 

a tuple of m different n-tuples with entries in the field F q such that X is optimal. Now let 
b G F™ be the vector defined as 

h := f(ifi), i = 1, ...,m 

s := dim(ker($^)) = d — m (cf. (5)), (ui,...,u s ) a basis for ker($^) and 
{u s+ i, Ud} C {g n qa}a£M n an arbitrary subset containing d — s elements. Then the prob- 

— * 

ability P that the orthogonal solution g* of^^(g) = b with respect to the generalized inner 
product 

(ui,Uj) := Sij Vij'G {1, ...,d} 

fulfills f = g* is given by 



q n - t 
q n — m 



ift<m (9) 



and 



P = ift>m 

Proof. Due to the definition of general position, there are exactly 

(--•Kd5^w-('--)'Gi.)-(---)'(!: 
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different ways to extend a basis (ui,...,u s ) of U to a basis of F n (F q ) using m = d 
fundamental monomial functions. If t < m, among such extensions, only 

use the t fundamental monomial functions appearing in /. Now (9) follows immediately. If, 
on the other hand, t > m, the number of fundamental monomial functions usable to extend a 
basis (wi, u 8 ) of ker(<3>^) to a basis of F n (F q ) is too small and ker($^)- L is not big enough 
to generate /. ■ 

Remark 20 If the elements in the basis (g n qa)aeM™ are ordered in a decreasing way ac- 
cording to a term order (the biggest element is at the left end, the smallest at the right end 
and position t means counting t elements from the right to the left) an analogous probability 
formula would be 

Number of arrangements that place the mon. functions in f after position s 

Total number of arrangements 

where an arrangement is an order of the elements of (g n qa)aeM™ that obeys a term order. 
(Two different term orders could generate the same arrangement of the elements in the finite 
se t {gnqa}aeM™)- So, for instance, if f contains a term involving the monomial function 
xf -1 ■ ... -a^ -1 , then the above probability (10) would be equal to zero, since every arrangement 
of the elements in {g n qa}aeM™ that obeys a term order would make that monomial function 
biggest. (It is inherent to term orders to make some monomial functions always biggest). 
In more general terms, it is difficult to make estimates about the numbers involved in (10). 
This shows some of the disadvantages of using term orders. 

Remark 21 Since for relatively small n and q the number d := q n is already very large, it is 
obvious that one should calculate the asymptotic behavior of the probability formula (9) for 
d — > oo. Indeed, we have with t <m 



< 



d-t\ {d-t)\ 



d — m)_ (d — m)\(m — t))\ 



d\ " d\ 



m) m\(d-m)\ 



{d-t)\m\ (d-t)\m\ 
(m-t)\d\ ~ d\ 
ml 

d(d-l)...(d-t + l) ~ 



for d — > oo 



If we write the amount of data used in proportion to the size d = q n of the space F r q \ and 
the number of terms displayed by f relative to the size q n of the basis (g n qa)aeM™, it becomes 
apparent how quickly the probability formula converges to for d — > oo. So let r := m/d 
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and 7 := d — t. Then we would have 



d-t 

P \d-m) {d-t)\m\ 



In particular, it holds 



~ d r d\ d \ r d (m-t)\d\ 

m(m - l)...(m - t + 1) _ rd{rd - l)...(rd -t+1) 
r d d{d-l)...(d-t + l) ~ r d d(d-l)...(d-t + l) 
rdrd(l - i)...rd(l - _ r*d*(l - £)...(! - 



r-dd(l-I)...d(l-i=l) " r-d*(l -J)...(l - ^ 



^(i-A)-(i-^r) 

^(i-S-d- 4 ?) 



d-t 
d — rd 

~dT 

rd 



r* for big d 



This expression shows in a straightforward way how big the proportional amount of data 
should be in order to have an acceptable confidence in the obtained result. It also shows that 
for t close to d the probability is very low and the reverse engineering not feasible. Usually 
no information about t is available, so it is advisable to work with the maximal t, namely 
d — 1 or with an average value for t. 



4 Conclusions 



The results we have obtained in the previous section provide guidelines on how to design 
experiments to generate data to be used with the LS-algorithm for the purpose of reverse 
engineering a biochemical network. 

The following are minimal requirements on a set X C F" such that the LS-algorithm reverse 
engineers / based on the knowledge of the values that it takes on every point in the set X : 

1. If the LS-algorithm is used to reverse engineer a nonzero function / e F n (F 9 )\{0}, 
necessarily the data set X used must contain points were the function does not van- 
ish. In other words, not all the interpolation conditions must be of the type X{ i— > 
(Theorem 10). 

2. If the LS-algorithm is used to reverse engineer a function / e -F n (F g )\{0} displaying 
t different terms, it requires at least t different data points to completely reverse 
engineer / (Remark 12). 
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3. If / G F„(F 9 )\{0} is a polynomial function containing all p n possible fundamental 
monomial functions, no proper subset X C F™ of F™ will allow the LS-algorithm to 
find / (Remark 12). 

Our results also make possible the identification of optimal sets ICF" that make the 
LS-algorithm more likely to succeed in reverse engineering a function / e F n (F q ) based only 
on the knowledge of the values that it takes on every point in the set X. Optimal data sets 
IcFJ are characterized by the property that ker($^) is in general position with respect to 
the basis (g n qa)aeM™ (see Definitions 17 and 14). Their advantage is given by the fact that 
they do not impose constraints on the set of candidate terms that can be used to construct 
a solution. Summarizing we can say: 

1. Even though such sets can be constructed in particular examples (see Appendix), 
further research is required to prove their existence in general terms. 

2. If no optimal sets can be determined, it is still advantageous to work with pseudo- 
optimal data sets (see Remark and Definition 18). 

Since the identified optimal data sets are sets X C F™ of discretized vectors, in a real 
application, the optimal data set X has to be transformed back to a corresponding set 
X C M" of real vectors. This transformation can be performed using an "inverse" function 
of the discretization mapping (1). This "inverse" function has to be defined by the user, 
given the fact that discretization mappings are highly non-injective and by definition map 
entire subsets Z CW 1 into a single value zE F™. 

Having characterized optimal data sets, the next step in our approach was to provide an 
exact formula for the probability that the LS-algorithm will find the correct model under 
the assumption that an optimal data set is used as input. As stated in Remark 20, we 
weren't able to find such a formula for the LS-algorithm. The biggest difficulty we face is 
related to the use of term orders inherent to the LS-algorithm. We overcome this problem 
by considering a generalization of the LS-algorithm which we call the term-order-free reverse 
engineering method (see Appendix). This method not only allows for the calculation of the 
success probability but it also eliminates the issues and arbitrariness linked to the use of 
term orders (see Remark 20). In conclusion, our results on this issue are: 

1. It is still an open problem how to derive a formula for the success probability of the 
LS-algorithm when optimal data sets are used as an input and the term order is chosen 
randomly. As stated in Remark 20, one of the main problems here is related to the 
use of term orders inherent to the LS-algorithm. 

2. Let / G F n (F 9 )\{0} be a nonzero function consisting of the linear combination of 
exactly t fundamental monomial functions. If the linear order used by the term-order- 
free method is chosen randomly, the probability of successfully retrieving / using an 
optimal data set X of cardinality \X\ — m is given by (see Theorem 19) 



q n -t 
q n — m 

rn 



iit<m (11) 
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and 

P = if t > m 

3. Let d = q n be the cardinality of the space F™. Furthermore, let X be an optimal data 
set with cardinality \X\ = m and r := m/d (note that < r < 1). Then the asymptotic 
behavior of the probability formula (11) for d — > oo (i.e. for n — > oo) satisfies (see 
Remark 21) 

' d-t 



d — rd 

J 

rd 



w r for big d 



As a consequence of the latter, we conclude that even if an optimal data set is used and the 
restrictions imposed by the use of term orders are overcome, the reverse engineering problem 
remains unfeasible, unless experimentally impracticable amounts of data are available. 

Finally, we comment on one scenario identified in (Just, 2006). Specifically, in Conclusion 
4(a), (Just, 2006) makes the assumption that the wiring diagram of each of the underlying 
functions is known, i.e. the variables that actually affect the function / are known. Under 
this assumption, let k be the number of variables affecting /. If one could perform specific 
experiments such that for all possible values that the k variables can take the response of the 
network is measured, the function / would be uniquely determined. In this situation, reverse 
engineering / wouldn't imply making any choices among possible solutions. This raises the 
question of how many measurements are needed and how big this data set would be in 
proportion to the size q n of the space F™ of all possible states the network can theoretically 
display. The number of measurements needed is q k and therefore the proportion is equal to 

q k 1 



If k is small compared to n (which is generally assumed by (Just, 2006)), then the proportion 
would be conveniently small. In other words, in relative terms, it is worth performing 
the q k specific experiments. However, performing q k measurements might still be beyond 
experimental feasibility. 
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6 Appendix 

6.1 Examples of vector spaces in general position and the 
codimension condition 

Example 22 Let n = 2, q = 2 and consider the vector space F 2 (F 2 ) an its basis 
(<?22a:)aeMf = (^1^2,^1,^2,1) ordered according to the lexicographic order with X\ > x 2 . 
Furthermore let U := span{x\X 2 + Xi + x 2 + 1). The basis vector Ui := X\X 2 + Xi + x 2 + 1 has 
the coordinates (1,1,1,1)* with respect to the basis (<722a)aeM§- Therefore, U is in general 
position with respect to (<?22a) Q eMf ■ It ^ s eas U t° verify 

|\/«y9-Vi)))| = \{(x,y) eFj\ xy + x + y +1 = mod2}\ 
= |{(0,1),(1,0),(1,1)}| = 3 
= 2 2 - 1 = codimiU) 

As a consequence, the set X := {(0, 1), (1, 0), (1, 1)} constitutes an optimal data set to reverse 
engineer any function f e -F 2 (F 2 ) displaying no more than 3 terms. If the term-order- 
free reverse engineering method is used, the probability of successfully retrieving a nonzero 
function displaying 1 term would be 




= 0.75 



For a function displaying 2 terms P = 0.5 and 3 terms P = 0.25. 

Example 23 Let n = 2, q = 3 and consider the vector space F 2 (F 3 ) an its basis 
(5 , 23a)aeM| = ( x i x 2i x \ x 2 , ^i^ 2 , x\ , x 1 x 2 , x\ , X\ , x 2 , 1) ordered according to a total degree term 
order with x 1 > x 2 . Furthermore let U be the 8- dimensional subspace of F 2 (F 3 ) generated by 

U := span{x\x\ + x\x 2l x\x 2 + X\x\, X\x\ + xl, x\ + x±x 2 , X\X 2 + x\, x\ + xi, X\ + x 2 , x 2 + 1) 

The coordinate vectors of the generating vectors are 

«i := (1,1,0,...,0)* 
u 2 := (0,1,1,0,...,0)* 



« 8 :=(0,.., 0,1,1)* 
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By calculating the determinant of the matrices 



A 



3 ■' 



it 



j = l,...,9 



(where ej is the jth canonical unit vector of~F\), one can easily show that U is in general 
position with respect to (<?23a)aeMf- To determine the set V ((tp^ 1 (ui) , . . . , ip^ 1 (u$))) , we start 
solving the three last equations given by 



X2 + Xi 
Xi + x 2 
X2 + I-- 







-1 



= ^ Xi — 1 

x 2 = -l 



This system of equations has no solution in the set F§. Therefore 

V({<p- 1 (u 1 ),...,<p- 1 (u s ))) = ® 

Consequently, U does not satisfy the codimension condition and thus does not yield an optimal 
data set. 



6.2 Existence of vector subspaces in general position 

The proof is easy but quite technical. The basic idea of the proof is to treat the problem 
over the real numbers and then construct a solution over finite fields based on the existence 
of a solution over the real numbers. This last step takes advantage of the density of the 
rational numbers in the set of real numbers. 

We recall the definition of general position for vector spaces over a finite field: 

Definition 24 Let W be a finite dimensional vector space over a finite field F q with 
dim(W) = d > 0. Furthermore, let (wi,...,Wd) be a fixed basis of W and s G N a natu- 
ral number with s < d. A vector subspace U C W with dim(C7) = s is said to be in general 
position with respect to the basis (wi, ...,Wd) if for any basis (vi, ...,v s ) ofU and any injective 
mapping 

7r:{l,..,(d-s)}-{l,..,t0 

the vectors 

V 1 ,...,V a ,W n (i),...,W w ( d - 8 ) (12) 

are linearly independent. 

It can be easily shown that if the linear independence condition (12) holds for one basis 
of U, it holds for every other basis of U. 

Now we will construct an s-dimensional subspace U C W in general position with respect 
to a given basis of W, where s is an arbitrary natural number with s < d. For this purpose 
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we will find the coordinates with respect to the basis (w 1 , ...,w d ) of a basis of U. We denote 
the sought coordinates as follows 







/ x d +i\ 




^(s-l)dH-l^ 




,6 = 




i ' ' ' j Ss 




\x d J 




\ %2d ) 




\ ^ / 



The next step is to count all different injective mappings ir : {1, (d — s)} — > {l,...,d} 
as 7Ti, 7Tjy- For each 7Tj we consider the coordinate vectors £i, £ s , e^i), e^^ds) with 
respect to the basis (wi, ...,Wd) , where e} is the jth canonical unit vector of F^. Now, for 
i — 1, N we define the determinant functions 



nsd 



X 



£l) •") ^7Tj(l)) €"Ki(d—s) 

The linear independence condition 



where e} is seen as the jth canonical unit vector of I 
(12) is equivalent to 

Due to the structure of ^fi, e^i), e Wi ^- s )j an d by the Leibniz determinant formula 
we know that D n . are nonzero polynomial functions in the variables x±, ...,x s d and therefore 
nonzero analytic functions in W d with infinite radius of convergence, (in particular, contin- 
uous functions). Consequently, no D Wz can be identical to zero on any open subset of R sd . 
By the continuity of D ni we know that there is a non-empty open subset 0\ C W d such that 
D ni \oi 7^ 0. Using the same argument we know that there is a non-empty set O2 C 0\ open 



in M. such that D n2 \o 2 0- After applying this argument N times we identify a non-empty 
open subset N C R sd such that D n .\ ON ^ V j £ {1, ...,N}. Since the set Q sd is a dense 
subset of R sd , there is a point y G Oat with rational entries, i.e. yi G Q V Z G {1, s<i}. Let 

y = (t-'-'T - ) 



and c := rifc=i fc- Since y G 
determinants we also know 



Oat, we know ^(y) 7^ V i G {l,...,iV}. By the rules of 
D nt (cy)^0Vte{l,...,N} 



Moreover, cy has integer entries, i.e. cyi G Z V / G {1, For a sufficiently large prime 

number p, the entries cyi can be seen as elements of the finite field F p of integers modulo 
p. Therefore, the values cyi G F p , / = 1, sd can be used as the coordinates with respect 
to the basis (w±, ...,Wd) of a basis for an s-dimensional subspace U C W in general position 
with respect to the basis (wi, ...,Wd) of W, a vector space over the finite field F p . ■ 
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6.3 The term-order-free reverse engineering algorithm 

The input of the term-order-free reverse engineering algorithm is a set X C F™ of m < q n 
different data points, a list of m interpolation conditions 

e x 

and a linear order > for the elements of the basis (gnqa)a<=M™ of F n (F q ), (i.e. the elements 
of the basis are ordered decreasingly according to > ). The steps of the algorithm are as 
follows: 

1. Calculate the entries of the matrix 

A := (<f> x (g nga ))aeM2 e M(m x q n ;F q ) 

— * 

representing the evaluation epimorphism & x of the tuple X with respect to the basis 
(9nqa)aeM™ of F n (F q ) and the canonical basis of F™. 

2. Calculate a basis y-y, y s G F^ of ker(A). 

3. Extend the basis yi,...,y s of ker(A) to a basis (yi, y s , y s +i, yd) of F^ using the 
standard orthonormalization procedure. (See Section 4 of (Delgado-Eckert, under re- 
view)). 

4. Define a generalized inner product (., .) : F^ — > F g by setting 

(j7i,j7j) : = Sij Vij'e {1,...,^} 

and calculate the entries of the matrix 5 defined by 

5y := (ei,ej), i,j G {l,...,g n } 
where e,j is the j'th canonical unit vector of F^. 

5. The coordinate vector with respect to the basis {g n qa)aeM™ of the output function is 
obtained by solving the following system of inhomogeneous linear equations 

Az = b 
$Sz = 0, i = l,...,s 

The steps described above represent an intelligible description of the algorithm and are 
not optimized for an actual computational implementation. 
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